56 research outputs found
On Robust Face Recognition via Sparse Encoding: the Good, the Bad, and the Ugly
In the field of face recognition, Sparse Representation (SR) has received
considerable attention during the past few years. Most of the relevant
literature focuses on holistic descriptors in closed-set identification
applications. The underlying assumption in SR-based methods is that each class
in the gallery has sufficient samples and the query lies on the subspace
spanned by the gallery of the same class. Unfortunately, such assumption is
easily violated in the more challenging face verification scenario, where an
algorithm is required to determine if two faces (where one or both have not
been seen before) belong to the same person. In this paper, we first discuss
why previous attempts with SR might not be applicable to verification problems.
We then propose an alternative approach to face verification via SR.
Specifically, we propose to use explicit SR encoding on local image patches
rather than the entire face. The obtained sparse signals are pooled via
averaging to form multiple region descriptors, which are then concatenated to
form an overall face descriptor. Due to the deliberate loss spatial relations
within each region (caused by averaging), the resulting descriptor is robust to
misalignment & various image deformations. Within the proposed framework, we
evaluate several SR encoding techniques: l1-minimisation, Sparse Autoencoder
Neural Network (SANN), and an implicit probabilistic technique based on
Gaussian Mixture Models. Thorough experiments on AR, FERET, exYaleB, BANCA and
ChokePoint datasets show that the proposed local SR approach obtains
considerably better and more robust performance than several previous
state-of-the-art holistic SR methods, in both verification and closed-set
identification problems. The experiments also show that l1-minimisation based
encoding has a considerably higher computational than the other techniques, but
leads to higher recognition rates
-softmax: Improving Intra-class Compactness and Inter-class Separability of Features
Intra-class compactness and inter-class separability are crucial indicators
to measure the effectiveness of a model to produce discriminative features,
where intra-class compactness indicates how close the features with the same
label are to each other and inter-class separability indicates how far away the
features with different labels are. In this work, we investigate intra-class
compactness and inter-class separability of features learned by convolutional
networks and propose a Gaussian-based softmax (-softmax) function
that can effectively improve intra-class compactness and inter-class
separability. The proposed function is simple to implement and can easily
replace the softmax function. We evaluate the proposed -softmax
function on classification datasets (i.e., CIFAR-10, CIFAR-100, and Tiny
ImageNet) and on multi-label classification datasets (i.e., MS COCO and
NUS-WIDE). The experimental results show that the proposed
-softmax function improves the state-of-the-art models across all
evaluated datasets. In addition, analysis of the intra-class compactness and
inter-class separability demonstrates the advantages of the proposed function
over the softmax function, which is consistent with the performance
improvement. More importantly, we observe that high intra-class compactness and
inter-class separability are linearly correlated to average precision on MS
COCO and NUS-WIDE. This implies that improvement of intra-class compactness and
inter-class separability would lead to improvement of average precision.Comment: 15 pages, published in TNNL
Video Storytelling: Textual Summaries for Events
Bridging vision and natural language is a longstanding goal in computer
vision and multimedia research. While earlier works focus on generating a
single-sentence description for visual content, recent works have studied
paragraph generation. In this work, we introduce the problem of video
storytelling, which aims at generating coherent and succinct stories for long
videos. Video storytelling introduces new challenges, mainly due to the
diversity of the story and the length and complexity of the video. We propose
novel methods to address the challenges. First, we propose a context-aware
framework for multimodal embedding learning, where we design a Residual
Bidirectional Recurrent Neural Network to leverage contextual information from
past and future. Second, we propose a Narrator model to discover the underlying
storyline. The Narrator is formulated as a reinforcement learning agent which
is trained by directly optimizing the textual metric of the generated story. We
evaluate our method on the Video Story dataset, a new dataset that we have
collected to enable the study. We compare our method with multiple
state-of-the-art baselines, and show that our method achieves better
performance, in terms of quantitative measures and user study.Comment: Published in IEEE Transactions on Multimedi
Automatic Classification of Human Epithelial Type 2 Cell Indirect Immunofluorescence Images using Cell Pyramid Matching
This paper describes a novel system for automatic classification of images
obtained from Anti-Nuclear Antibody (ANA) pathology tests on Human Epithelial
type 2 (HEp-2) cells using the Indirect Immunofluorescence (IIF) protocol. The
IIF protocol on HEp-2 cells has been the hallmark method to identify the
presence of ANAs, due to its high sensitivity and the large range of antigens
that can be detected. However, it suffers from numerous shortcomings, such as
being subjective as well as time and labour intensive. Computer Aided
Diagnostic (CAD) systems have been developed to address these problems, which
automatically classify a HEp-2 cell image into one of its known patterns (eg.
speckled, homogeneous). Most of the existing CAD systems use handpicked
features to represent a HEp-2 cell image, which may only work in limited
scenarios. We propose a novel automatic cell image classification method termed
Cell Pyramid Matching (CPM), which is comprised of regional histograms of
visual words coupled with the Multiple Kernel Learning framework. We present a
study of several variations of generating histograms and show the efficacy of
the system on two publicly available datasets: the ICPR HEp-2 cell
classification contest dataset and the SNPHEp-2 dataset.Comment: arXiv admin note: substantial text overlap with arXiv:1304.126
MCM: Multi-condition Motion Synthesis Framework for Multi-scenario
The objective of the multi-condition human motion synthesis task is to
incorporate diverse conditional inputs, encompassing various forms like text,
music, speech, and more. This endows the task with the capability to adapt
across multiple scenarios, ranging from text-to-motion and music-to-dance,
among others. While existing research has primarily focused on single
conditions, the multi-condition human motion generation remains underexplored.
In this paper, we address these challenges by introducing MCM, a novel paradigm
for motion synthesis that spans multiple scenarios under diverse conditions.
The MCM framework is able to integrate with any DDPM-like diffusion model to
accommodate multi-conditional information input while preserving its generative
capabilities. Specifically, MCM employs two-branch architecture consisting of a
main branch and a control branch. The control branch shares the same structure
as the main branch and is initialized with the parameters of the main branch,
effectively maintaining the generation ability of the main branch and
supporting multi-condition input. We also introduce a Transformer-based
diffusion model MWNet (DDPM-like) as our main branch that can capture the
spatial complexity and inter-joint correlations in motion sequences through a
channel-dimension self-attention module. Quantitative comparisons demonstrate
that our approach achieves SoTA results in both text-to-motion and competitive
results in music-to-dance tasks, comparable to task-specific methods.
Furthermore, the qualitative evaluation shows that MCM not only streamlines the
adaptation of methodologies originally designed for text-to-motion tasks to
domains like music-to-dance and speech-to-gesture, eliminating the need for
extensive network re-configurations but also enables effective multi-condition
modal control, realizing "once trained is motion need"
- …